1 

Distilled Sensing: Adaptive Sampling for 
Sparse Detection and Estimation 

Jarvis Haupt, Rui Castro, and Robert Nowak 
Abstract 

Adaptive sampling results in dramatic improvements in the recovery of sparse signals in white Gaussian noise. 
A sequential adaptive sampling-and-refinement procedure called Distilled Sensing (DS) is proposed and analyzed. 
DS is a form of multi-stage experimental design and testing. Because of the adaptive nature of the data collection, 
DS can detect and localize far weaker signals than possible from non-adaptive measurements. In particular, reliable 
detection and localization (support estimation) using non-adaptive samples is possible only if the signal amplitudes 
grow logarithmically with the problem dimension. Here it is shown that using adaptive sampling, reliable detection 
is possible provided the amplitude exceeds a constant, and localization is possible when the amplitude exceeds any 
arbitrarily slowly growing function of the dimension. 

I. Introduction 

In high dimensional multiple hypothesis testing problems the aim is to identify the subset of the hypotheses 
that differ from the null distribution, or simply to decide if one or more of the hypotheses do not follow the null. 
There is now a well developed theory and methodology for this problem, and the fundamental limitations in the 
high dimensional setting are quite clear However, most existing treatments of the problem assume a non-adaptive 
measurement process. The question of how the limitations might differ under a more flexible, sequential adaptive 
measurement process has not been addressed. This paper shows that this additional flexibility can yield surprising 
and dramatic performance gains. 

For concreteness let x = {xi , . . . , Xp) E M.P be an unknown sparse vector, such that most (or all) of its components 
Xi are equal to zero. The locations of the non-zero components are arbitrary. This vector is observed in additive 
white Gaussian noise and we consider two problems: 
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Localization: Infer the locations of the few non-zero components. 

Detection: Decide whether x is the all-zero vector 
Given a single, non-adaptive noisy measurement of x, a common approach entails coordinate-wise thresholding of 
the observed data at a given level, identifying the number and locations of entries for which the corresponding 
observation exceeds the threshold. In such settings there are sharp asymptotic thresholds that the magnitude of the 
non-zero components must exceed in order for the signal to be localizable and/or detectable. Such characterizations 
have been given in various contexts in |[T]-|l3] for the localization problem and H-l^j for the detection problem. 
A more thorough review of these sorts of characterizations is given in Section 

In this paper we investigate these problems under a more flexible measurement process. Suppose we are able 
to sequentially collect multiple noisy measurements of each component of x, and that the data so obtained can be 
modeled as 

yi,3 = X., + -i'^y^Wi^j, i = j = . (1) 

In the above a total of k measurement steps is taken, j indexes the measurement step, Wij J\f{0, 1) are zero- 
mean Gaussian random variables with unit variance, and -jij > quantifies the precision of each measurement. 
When 7i.j = we adopt the convention that component Xi was not observed at step j. The crucial feature of 
this model is that it does not preclude sequentially adaptive measurements, where the can depend on past 
observations {yi,e}te{i,....p}j<j- 

In practice, the precision for a measurement at location i at step j may be controlled, for example, by collecting 
multiple independent samples and averaging to reduce the effective observation noise, the result of which would be 
an observation described by the model In this case, the parameters {ji.j} can be thought of as proportional to 
the number of samples collected at location i at step j. For exposure-based sampling modalities common in many 
imaging scenarios, the precision parameters {7^^ } can be interpreted as being proportional to the length of time 
for which the component at location i is observed at step j. 

In order to make fair comparisons to non-adaptive measurement processes, the total precision budget is limited 
in the following way. Let R{p) be an increasing function of p, the dimension of the problem (that is, the number 
of hypotheses under scrutiny). The precision parameters {^i.j} are required to satisfy 

k p 

j=i i=i 

For example, the usual non-adaptive, single measurement model corresponds to taking R{p) = p, fc = 1, and 
7i.i = 1 for i = 1, . . . ,p. This baseline can be compared with adaptive procedures by keeping R{p) = p, but 
allowing k > 1 and variables {7;.^} satisfying (|2]i. 

The multiple measurement process ([T]i is applicable in many interesting and relevant scenarios. For example in 
gene association and expression studies, two-stage approaches are gaining popularity (see Q-ID and references 
therein): in the first stage a large number of genes is initially tested to identify a promising subset of them, and in the 
second-stage these promising genes are subject to further testing. Such ideas have been extended to multiple-stage 
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approaches; see, for example ifTOl . Similar two-stage approaches have also been examined in the signal processing 
hterature-see ifTTl . More broadly, sequential experimental design has been popular in other fields as well, such as 
in computer vision where it is known as active vision lfT2l . or in machine learning, where it is known as active 
learning ifTsl . lfT4l . These types of procedures can potentially impact other areas such as microarray-based studies 
and astronomical surveying. 

The main contribution of this paper is a theoretical analysis that reveals the dramatic gains that can be attained 
using such sequential procedures. Our focus here is on a particular sequential, adaptive sampling procedure called 
Distilled Sensing (DS). The idea behind DS is simple: use a portion of the precision budget to crudely measure all 
components; eliminate a fraction of the components that appear least promising from further consideration after this 
measurement; and iterate this procedure several times, at each step measuring only components retained after the 
previous step. As mentioned above, similar procedures have been proposed in the context of experimental design, 
however to the best of our knowledge the quantification of performance gains had not been established prior to our 
own initial work in ITSl . lfT6l and the results established in this paper In this manuscript we significantly extend our 
previous results by providing stronger results for the localization problem, and an entirely novel characterization of 
the detection problem. 

This paper is organized as follows. Following a brief review of the fundamental limits of non-adaptive sampling for 
detection and localization in Section |ll] our main result — that DS can reliably solve the localization and detection 
problems for dramatically weaker signals than what is possible using non-adaptive measurements — is stated in 
Section |lll] A proof of the main result is given in Section |IV] Simulation results demonstrating the theory are 
provided in Section [V] and conclusions and extensions are discussed in Section [Vl] A proof of the threshold for 
locaUzation from non-adaptive measurements and several auxiliary lemmas are provided in the appendices. 

II. Review of Non-adaptive Localization and Detection of Sparse Signals 

In this section we review the known thresholds for localization and detection from non-adaptive measurements. 
As mentioned above, such thresholds have been established in a variety of problem settings |[T]-||6l. Here we provide 
a concise summary of the main ideas along with supporting proofs as needed, to facilitate comparison with our 
main results concerning recovery from adaptive measurements which appear in the next section. 

The non-adaptive measurement model we will consider as the baseline for comparison is as follows. We have a 
single observation of x in noise: 

Ui = Xi + Wi, i = l,...,p, (3) 

where Wi M{0, 1). As noted above, this is a special case of our general setup dTJ in which fe = 1 and 7i_i — 1 
for i — 1, . . . ,p. This implies a precision budget R{p) = X]r=i 7»-i = P- 

To describe the asymptotic (large p) thresholds for localization we need to introduce some notation. Define the 
false-discovery proportion (FDP) and non-discovery proportion (NDP) as follows. 
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Definition II.l. Let S :^ {i : Xi ^ 0} denote the signal support set and let S = S{y) denote an estimator of S. 
The false-discovery proportion is 



FDP(5) := 




In words, the FDP of S is the ratio of the number of components falsely declared as non-zero to the total number 
of components declared non-zero. The non-discovery proportion is 

NDP(5) := ^ . 

In words, the NDP of S is the ratio of the number of non-zero components missed to the number of actual non-zero 
components. 

In this paper we focus in particular on the scenario where Xi > for all i E {1, . . . ,p}. We elaborate on possible 
extensions in Section Under this assumption it is quite natural to focus on a specific class of estimators of S. 

Definition II.2. A coordinate-wise thresholding procedure is an estimator of the following form: 

Sr{y) ■■= {^e{l,...,p}■.y^>T>0} , 
where the threshold t may depend implicitly on x, or on y itself. 

The following result establishes the limits of localization using non-adaptive sampling. A proof is provided in 
Appendix lAl 

Tlieorem II.3. Assume x has p^~^, jS G (0, 1), non-zero components of amplitude \J2r logp, r > 0, and 
measurement model There exists a coordinate-wise thresholding procedure that yields an estimator S = S{y) 
such that if r > 13, then as p ^ oo, 

FDP(5) 4 , NDP(5) 4 , 

where — > denotes convergence in probability. Moreover, if r < fi, then there does not exist a coordinate-wise 
thresholding procedure that can guarantee that both quantities above tend to as p ^ oo. 

We also refer the reader to recent related work in |3J, which considered localization under similar error metrics 
as those utilized here. There it was shown, using a random signal model and assuming observations in the form of 
noisy independent random (Gaussian) linear combinations of the entries of x, that similar sharp asymptotics hold 
for any recovery procedure lH] Thm. 5]. 

Random signal models have also been adopted in the examination of the fundamental limits of signal detection 
llll-l6l- In particular, suppose that x is such that its entries Xi have amplitude = y/2r logp independently with 
probability d{p) ~ p^^ , and amplitude zero with probability 1 — 6{p). The problem of signal detection from noisy 
observations collected according to the measurement model (|3]l amounts to a hypothesis test of the form: 

Ho : yr - AA(0,1), i = l,...,p 

Hi : {l-e{p))N{Q,l) + e{p)N{ii{p),l), i^l,...,p (4) 
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Note that under the alternative hypothesis, the signal has ^ non-zero components in expectation. We recall the 
following result lH-llll. 

Theorem II.4. Consider the hypotheses in d?]) where fi{p) ~ \J1r log p. Define 

0, < /3 < 1/2 



/3-1/2, l/2</3<3/4 



3/4</3<l 

If r > p{P), then there exists a test for which the sum of the false alarm and miss probabilities tends to as 
p — > oo. Conversely, if r < p(/3), then for any test the sum of the false alarm and miss probabilities tends to 1 as 
p — ?> oo. 

It is possible to relate these detection results to the deterministic sparsity model that we consider here, using the 
ideas presented in iflTl Chapter 8]. 

III. Main Results: Adaptive Localization and Detection of Sparse Signals 

In this section we present the main results of our theoretical analysis of Distilled Sensing (DS). Algorithm [T] 
describes the DS measurement process. At each step of the process, we retain only the components with non- 
negative observations. This means that when the number of non-zero components is very small, roughly half of the 
components are eliminated from further consideration at each step. Consequently, if the precision budget allocated 
at each step is slightly larger than 1/2 of that used in the preceding step, then the effective precision of the 
measurements made at each step is increasing. In particular, if the budget for each step is 1/2 + c of the budget 
at the previous step, for some small constant c > 0, then the precision of the measured components is increasing 
exponentially. Therefore, the key is to show that the very crude thresholding at at each step does not remove a 
significant number of the non-zero components. One final observation is that because the number of components 
measured decreases by a factor of roughly 1/2 at each step, the total number of measurements made by DS is 
roughly 2p, a modest increase relative to the p measurements made in the non-adaptive setting. 

Recall from above that for non-adaptive sampling, reliable detection and localization is only possible provided 



the signal amplitude is n{y^log{p)). In other words, the signal amplitude must exceed a constant (that depends on 



the sparsity level) times y'k)g(j)). The following theorem establishes that DS is capable of detecting and locaHzing 
much weaker sparse signals. For the purposes of our investigation we assume that the non-zero components are 
positive. It is trivial to extend the algorithm and its analysis to handle both positive and negative components 
by simply repeating the entire process twice; once as described, and again with yij replaced with —yij in the 
refinement step of Algorithm [U 

Theorem III.l. Assume x > with p^~^, /3 G (0, 1), non-zero components of amplitude ^i{p), and sequential 
measurement model using Distilled Sensing with k = k{p) = max{ [log2 logp] , 0} + 2, and precision budget 
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Algorithm 1: Distilled Sensing. 
Input: 

Number of observation steps: k; 

Resource allocation sequence satisfying J2'j=i — Rip)' 

Initialize: 

Initial index set: Ii < — {1,2, ... ,p}; 

Distillation: 

for j = 1 to fc do 



Refine: Ij+i i — {i £ Ij : yij > 0}; 
end 

Output: 

Final index set: I^; 

Distilled observations: yk = {yi,k ■ i & Ik}', 



distributed over the measurement steps so that X]j=i — P' ^j+i/^j > (5 > 1/2, and Ri =- c\p and Rk = CkP 
for some Ci,Ck G (0, 1). Then the support set estimator constructed using the output of the DS algorithm 



Allocate resources: 




Observe: yij ^ Xi + 7, 



Wij, i € If, 



^DS 



{« e 4 : yi.k > \/2/ck} 



has the following properties: 



(i) if //(p) — > 00 fli a function of p, then as p ^ 00 



FDP(5ds) 4 0, NDP(5ds) 4 , 



(ii) if fi{p) > max 



■\/4/ci, 2y/2/ck > (a constant) then 





where is the empty set. 
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In words, this result states that DS successfully identifies the sparse signal support provided only that the signal 
amplitude grows (arbitrarily slowly) as a function of the problem dimension p, while reliable signal detection 
requires only that the signal amplitude exceed a constant. The result (ii) is entirely novel, and (i) improves on 
our initial result in [16 1 which required fi{p) to grow faster than an arbitrary iteration of the logarithm (i.e.. 



fi{p) ^ log log . . . logp). Comparison with the f2(^/logp) amplitude required for both tasks using non-adaptive 
sampling illustrates the dramatic gains that are achieved through adaptivity. 



IV. Analysis of Distilled Sensing 

In this section we prove the main result characterizing the performance of Distilled Sensing (DS), Theorem lIII.il 
We begin with three lemmas that quantify the finite sample behavior of DS. 

A. Distillation: Reject the Nulls, Retain the Signal 

Lemma IV.l. If {yi}"U '^J^{0,a^), <t > 0, then for any < e < 1/2, 



e ] m < 

2 



{i G {!,..., m} : y, > 0} 



with probability at least 1 — 2exp (— 2me^). 



Proof: For any event A, let 1^ be the indicator taking the value 1 if A is true and otherwise. By Hoeffding's 
inequality, for any e > 



Pr 



El 

i=l 



{y^>o} 



> me < 2 cxp (— 2me 



Imposing the restriction e < 1/2 guarantees that the corresponding fractions are bounded away from zero and one. 



Lemma IV.2. Let ~ -^(^jCT^), with a > and ^ > 2cr. Define e ~ < 1. Then 



{l-e)m< {i e {1,2, ... ,m} : yi > 0} < m, 
with probability at least 1 — cxp ^— ^ ^ . 

Proof: We will utilize the following standard bound on the Gaussian tail: for Z ^ J^iO, 1) and 7 > 0, 
1 ^1 _ J_) cxp(-7V2) < Pr(Z > 7) < -jL= cxp(-7V2). 



r 



V^7' 



Let q = Pr(j/i > 0), then it follows that 



(7 / H 



1 - O < , ^. , 

Next we use the Binomial tail bound from iflSl : for any < 6 < EEZli -'-{i;i>o}] = "^P' 



Pr 



{Vi>0} 



<b \ < 



m — mp 
m — b 



2-b 



/mpy 
\~b~) 
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Note that e > 1 — q {or equivalently, 1 — e < q), so we can apply this resuh to -'-{i;i>o} with 6 = (1 — e)m 

to obtain 



E My. >o} < (1 - < 



1 X em / \ (l — e)m. 

l-q\ f g ^ ' 



< exp 



1 -e 
fi''em\ / 1 



M 1 - 



Now, to establish the stated result, it suffices to show 

which holds provided /^t > 2(t, since < e < 1 and (i^) log ji^) < 1 for e e (0, 1). ■ 

B. The Output of the DS Procedure 

Refer to Algorithm [T] and define Sj \Sf]Ij \ and Zj := jiS'^ P| |, the number of non-zero and zero components 
respectively, present at the beginning of step j, for j = 1, . . . , fc. Let e > 0, and for = 1, . . . , — 1 define 

2 si + (l/2 + ep-i^i 



The output of the DS procedure is quantified in the following result. 



(5) 



Lemma rV.3. Lef < e < 1/2 and assume that R-j > {si + (1/2 + ey~^zi), j = 1, . . . ,k - 1. If \S\ > 0, 
then with probability at least 

]^^~J(l — ei)si < Sj < si and (i — e)"' ^ zi < Zj < + e)"' ^ zi /or j = 2, . . . , fc. 7/' |5| = 0, with 
probability at least 

k-l 

l-2Eexp(-2zi(l/2-ey-ie2) , 

(i - zi < < (i + ey~' zi /or J = 2, ... , fc. 

Proof: The results follow from Lemmas |IV. 11 and IIV.2I and the union bound. First assume that si = \S\ > 0. 
Let crj := \Ij\/Rj = {sj + Zj)/R, and := j = 1, . . . , fc. 

The argument proceeds by conditioning on the output of all prior refinement steps; in particular, suppose that 
(1 — Q_i)sf_i < < Sf_i and (i — e) zp-i < Z£ < + e) Zf„ifor £ = 1, . . . , Then apply Lemma H V. 1 1 with 
m = Zj, Lemma llV.21 with m ~ Sj and cr^ = cr|, and the union bound to obtain that with probability at least 

1 - exp I --^^^^= I - 2 exp {-2zje^) , (6) 



4crjV27r 



(1— ej)sj < Sj+i < Sj, and (i — e) zj < Zj+i < (i + e) Zj. Note that the condition Rj > (si + (1/2 + e)-'^^zi] 
and the assumptions on prior refinement steps ensure that /i > 2 ctj, which is required for Lemma HV. 2 1 The condition 
jU > 2 CTj also allows us to simplify probability bound so that the event above occurs with probability at least 



1 — exp ( j= j — 2 cxp (— 2zj£ 

2v 27r 



2\ 



Next, we can recursively apply the union bound and the bounds on Sj and Zj above to obtain for j = 1 , . . . , fc — 1 



lsi + {l/ 2 + ey-^zi ^ 

,2R. - '^1 



with probability at least 



1 - E -P ''^""i-''^'^ - E 2 exp (-2.,(l/2 - sy-'s^ 



■ 1 V 2V27r , . , 



) • 



Note that the condition Rj > ^ (si + (1/2 + £)^^^zi) implies that < 1. The first result follows directly. If 

si = \S\ = 0, then consider only zj, j = I, . . . , k. The result follows again by the union bound. Note that for this 

statement the condition on Rj is not required. ■ 

Now we examine the conditions Rj > ^ (si + (1/2 + e)^~^zi), j = l,...,k more closely. Define c := 

si/[(l/2 + e^^^zi], in effect condensing several problem-specific parameters (si, zi, and k) into a single scalar 

parameter. Then the conditions on Rj are satisfied if 

^ 4zi(l/2+£)^ 
Rj > (c(l/2 + £) ■' + 1). 

Since zi < p, the following condition is sufficient 

and in particular the more stringent condition Rj > — will suffice. It is now easy to see that if 



s 



1 ^ zl (e.g., so that c < 1), then the sufficient conditions become Rj > |§(l/2 + e)^ ^ , j = 1, . . . , k. Thus, for 



the sparse situations we consider, the precision allocated to each step must be just slightly greater than 1/2 of the 
precision allocated in the previous step. We are now in position to prove the main theorem. 



C. Proof of Theorem in. 1 

Throughout the proof, whenever asymptotic notation or limits are used it is always under the assumption that 
p — !• oo, and we use the standard notation f{p) = o{g{p)) to indicate that limp_>oo f{p)/9{p) = 0, for f{p) > 
and g{p) > 0. Also the quantities k := k{p), e := e{p) and ^ ii{p) are functions of p, but we do not denote this 
explicitly for ease of notation. We let e : = p^^/"^ throughout the proof. 

We begin by proving part (ii) of the theorem, which is concerned with detecting the presence or absence of a 
sparse signal. Part (i), which pertains to identifying the locations of the non-zero components, then follows with a 
slight modification. 
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Case 1 - Signal absent (S = 0): This is the simplest scenario, but through its analysis we will develop tools 
that will be useful when analyzing the case where the signal is present. Here, we have si = and zi = p, and the 
number of indices retained at the end of the DS procedure \Ik\ is equal to Zk- Define the event 

i-e) p < \h\ < (i+e) p| . 
The second part of Lemma IIV.3I characterizes the probability of this event; in particular 

Pr(r) > l-2^'exp|^-2pQ-ey e^^ . 
Since k < logj logp + 3, for large enough p we get that 

Pr(r) > 1 - 2(A: - 1) exp ( -2p (- - s 



l-2(fc-l)expl -p( 1 ) (l-2e)'=-V 



2 

fc-3 

2 



> l-2(log2logp + 2)cxp -f— (l-o(l)) 

where we used Lemma IbTI to conclude that (1 — 2e)'^^^ = 1 — o(l). It is clear that Pr(r) — > 1. 

In this case we assume that 5 = 0, therefore the output of the DS procedure consists of |/fe| i.i.d. Gaussian 
random variables with zero mean and variance |/fc|/i?A; ~ \Ik\/{ckP)- Note that given F, 



fe-i 



which follows from the fact that k > log2 logp + 2, and using Lemma iB.ll With this in hand we conclude that 
(with a slight abuse of notation) 



Pr(5DS 7^ I r) = Pr (3,ei, : y^,k > VV^) 

< 141 Pr (aA(0, \h\/ckP) > VV^) 



= |/fc|Pr(^AA(0,l) > ^2p/\h\^ 

< pPi (^7V(0, 1) > V41ogp(l-o(l))) 

< pexp(-21ogp(l-o(l))) 

where the last inequality follows from the standard Gaussian tail bound. This together with Pr(r) 1 immediately 
shows that when 5 = we have Pr(5Ds 7^ 0) — > 0. 

Case 2 - Signal present (S ^ 0): The proof follows the same idea as in the previous case, although the argument 
is a little more involved. Begin by applying Lemma IIV.3I and constructing an event that occurs with probabiUty 
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tending to one. Let F be the event 

{fe-i 
sin(l-ej) < Sfc < Si 

where ej is given by equation (|5]l. Lemma IIV.3I characterizes the probabihty of this event under a condition on Rj 
that we will now verify. Note that this condition is equivalent to e| < l/(87r) for all j ~ 1, . . . , k — 1. Instead of 
showing exactly this we will show a stronger result that will be quite useful in a later stage of the proof. Recall that 
Rj+i/Rj > 5 > 1/2, j = — 2, and Ri = cip by the assumptions of the theorem. Thus for j = 1, . . . , fc — 1 

- 27rfi^5i-^Ri 

< 1 r£l,-o-«.£l'' ' 



Clearly we have that e| < 2-^1'^ a l/C^"") since by assumption ^ > -^/l/ci. Now consider the case j > 1. Recall 
that k < log2 logp + 3. Therefore if (5 > 1, then the term S^^^^^^ can be upper bounded by 1, otherwise 

^-(i-l) < J-(fe-2) < ^-(log^logp+l) ^ ^-1 (logp)-'°S2<5 < 21ogp , (7) 



where the last step follows from S > 1/2. 
Now recall that si = p^^^, therefore 

2TTfl'^Cl 



2 



Note that, since e ^ as p cx) we have that, for p large enough, 6/ {1/2 + e) > {6 + 1/2 + e). Assume p is 
large enough so that this is true, then 

Clearly since j < k-1 < log2logp + 2 we have that ((5 + i + e)"*^"^' = 17 (l/(logp)'°S2('5+i/2+0) and so 
the first of the additive terms in ([D is negligible for large p. Therefore for p sufficiently large, we have, for all 
j = l,...,fc-l 



1 / . 1 



Since by assumption > \J 1/c\, we conclude that, for all p sufficiendy large, e| < l/(87r) for all j = 1, . . . , fc— 1, 
and so Rj > ^ (si + (1/2 + £)^^^zi) for = 1, . . . , fc — 1. Thus, applying Lemma HV. 3 1 we have 

Pr(r) 



j=i \ / j=i 
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By a similar argument to that used in Case 1, it is straightforward to show that 

fe-i 

2^cxp(-2zi(l/2-e)^-i£2)^0 . 



In addition. 



fe-i 

^exp 



-^in£!(i-Q) 



/Stt 



< (fc — 1) exp 

< (fc — 1) exp 

< {k — 1) exp 



8tt 



-(£-l)/2 



7^ (-5- 



-(£-l)/2 



where in the last step we used the fact that fj, > ^/l/ci . Finally note that from Lemma IB. 21 we know that 



fc-2 



where L{S) > hence 



fe-i 

^exp 



/Stt 



. 1 
^+2 



-(f-l)/2^ 



L(<5) 



«in£i(i-e^) 



Stt 



< (log2logp + 2)exp 
^ . 



pi-^(£(J)+o(l)) 



(10) 



Therefore we conclude that the event F happens with probability converging to one. 

We now proceed as before, by conditioning on event F. The output of the DS procedure consists of a total of 
l^fel = Sk + Zk independent Gaussian measurements with variance |/fe|/i?fc, where Sk of them have mean fi and the 
remaining Zk have mean zero. We will show that the proposed thresholding procedure identifies only true non-zero 
components (i.e., correctly rejects all the zero-valued components). In other words, with probability tending to 
one, Sds = S Ik- For ease of notation, and without loss of generality, assume the yi,^ ^ A/'(/x, |/fc|/-Rfc) for 
z e {1, . . . , Sk} and y,,fc - Af{0, \h\/Rk) for i G {sk + 1, . . . , |/fc|}. Then 



\ik\ 



Pr ( 5ds ^ 5 n Ik 

Pr I {y^:k < ^27^} or U {ya- > VV^} 
Sk Pr (aA(m, \Ik\/Rk) < VV^) + Zk Pr (aA(0, \Ik\/Rk) > VWk 



< 
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Note that conditioned on the event F (using arguments similar to those in Case 1) 



^ \ fe-i 



|/fc| = Sk + Zk < si + zi \^-+ e 

< /-^ + -^(1 + 0(1)) < ^(l + o(l)). (11) 
21ogp 21ogp 



Finally, taking into account that /i > 2y/2/ck we conclude that 
Pr(5Ds ^5n/fc|F) 

< Sk Pr (aA(0, \Ik\/Rk) < ~VV^) + Zk Pr (aA(0, \Ik\/Rk) > VWk 

< «P.(Af,o.l)>yS:)+..Pr(wi)>y'|?) 



' 2p 

W\ 



= |/fc|Pr|^AA(0,l)> 

< pPr (j\f{0, 1) > V41ogp(l-o(l))) 

< pcxp(-21ogp(l-o(l))) 

where the last inequality follows from the standard Gaussian tail bound. This together with Pr(F) 1, and the 
fact that |5 n = Sk = L{6){1 — o(l))si is bounded away from zero for large enough p immediately shows that 
Pr(5Ds = 0) 0, concluding the proof of part (ii) of the theorem. 

Part (i) of the theorem follows from the result proved above, since if fi is any positive diverging sequence in p 
then a stronger version of Lemma IBT2] applies. In particular, recall (|9]l, and note that Lemma IBT2] implies 

We have already established that the events F and {Sus 7^ S O Ik} both hold (simultaneously) with probability 
tending to one. Conditionally on these events we have 

FDP(5ds) = — = , 

Sk 



and 



NDP(5ds) = = 1 _ ^ 

Sl Sl 



since from the definition of F we have 

fc-l 

Sl > Sk > Sl ]^(1 - e<?) ^ Sl . 
i=i 

Therefore we conclude that both FDP(5ds) NDP(5ds) converge in probability to zero as p — > oo, concluding 
the proof of the theorem. 
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V. Numerical Experiments 

This section presents numerical experiments with Distilled Sensing (DS). The results demonstrate that the 
asymptotic analysis predicts the performance in finite dimensional cases quite well. Furthermore, the experiments 
suggest useful rules of thumb for implementing DS in practice. 

There are two input parameters to the DS procedure; the number of distillation steps, fc, and the distribution 
of precision across the steps, {Rj}^^^^. Throughout our simulations we choose k — max{ [logj logp] , 0} + 2, 
as prescribed in Theorem IIII.ll For the precision distribution, first recall the discussion following the proof of 
Lemma HV. 3 1 There it is argued that if the sparsity model is valid, a sufficient condition for the precision distribution 
is Rj > i?i(l/2 + ey^^, j = 1, . . . , k, with < e < 1/2. In words, the precision allocated to each step must be 
greater than 1/2 the precision allocated in the previous step. In practice, we find that choosing Rj+i/Rj ~ 0.75 
for j = 1, . . . , fc — 2 provides good performance over the full SNR range of interest. Also, from the proof of the 
main result (Theorem IIII.ll i we see that the threshold for detection is inversely proportional to the square root of 
the precision allocation in the first and last steps. Thus, we have found that allocating equal precision in the first 
and last steps is beneficial. The intuition is that the first step is the most crucial in controlling the NDP and the final 
step is most crucial in controlling the FDR Thus, the precision allocation used throughout the simulations follows 
this simple formula: 

R, = {0.75y-'Ri, J =2,...,k-l, 

and i?i is chosen so that z2j=i Rj = P- 

Figure [T] compares the FDP vs. NDP performance of the DS procedure to non-adaptive (single observation) 
measurement at several signal-to-noise ratios (SNR = /i^). We consider signals of length p = 2^^ having ^/p = 128 
non-zero components with uniform amplitude with locations chosen uniformly at random. This choice of signal 
dimension corresponds to fc = 6 observation steps in the DS procedure. The range of FDP-NDP operating points is 
surveyed by varying the threshold applied to the non-adaptive measurements and the output of the DS procedure for 
each of 1000 trials, corresponding to different realizations of randomly-generated signal and additive noise. Recall 
that largest squared magnitude in a realization of p i.i.d. M{Q, 1) variables grows like 2 logp, and in our experiment, 
21ogp « 20. Consequently, when the SNR = 20 we see that both DS and non-adaptive measurements are highly 
successful, as expected. Another SNR level of interest is 8, since in this case this happens to approximately satisfy 
the condition ji = y/2/ci = ^IpjRi, which according to the Theorem IIII.ll is a critical level for detection using 
DS. The simulations show that DS remains highly successful at this level while the non-adaptive results are poor 
Finally, when the SNR = 2, we see that DS still yields useful results. For example, at FDP = 0.05, the DS 
procedure has an average NDP of roughly 80% (i.e., 20% of the true components are still detected, on average). 
This demonstrates the approximate logp extension of the SNR range provided by DS. Note the gap in the FDP 
values of the DS results (roughly from 0.75 to 1). The gap arises because the the output of DS has a higher SNR 
and is much less sparse than the original signal, and so arbitrarily large FDP values cannot be achieved by any 
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Fig. 1. FDP and NDP performance for DS (indicated witli *) and non-adaptive sensing (indicated with •) at different SNRs. Smaller values 
of FDP and NDP correspond to more accurate recovery (ie, exact support recovery occurs when NDP = FDP = 0). The results clearly show 
that DS outperforms non-adaptive sensing for each SNR examined. 



choice of threshold. Large FDP values are, of course, of little interest in practice. We also remark on the structured 
patterns observed in cases of high NDP and low FDP (in upper left of figures for SNR = 2 and SNR = 8). The 
visually structured 'curves' of NDP-FDP pairs arise when the total number of discoveries is small, and hence the 
FDP values are restricted to certain rational numbers. For example, if just 3 components are discovered, then the 
number of false-discoveries can only take the values 0, 1/3, 2/3, and 1. 

Figure |2] compares the performance of non-adaptive sensing and the DS procedure in terms of the false-discovery 
rate (FDR) and the non-discovery rate (NDR), which are the average FDP and NDP, respectively. We consider three 
different cases, corresponding to signals of length p = 2^^, 2^*^, and 2^°, (the solid, dashed, and dash-dot lines, 
respectively) where for each case the number of non-zero signal components is [p^/'^J . The precision allocation and 
number of observation steps are chosen as described above (here, fc = 6 for each of the three cases). For each value 
of SNR, 500 independent experiments were performed for DS and non-adaptive sampling, and in each, thresholds 
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Fig. 2. FDR and NDR vs. SNR comparison. The solid, dashed, and dash-dot lines correspond to signals of length p = 2^**, 2^^, and 2"^^ , 
respectively, having [p^''^J non-zero entries. At each value of SNR and for each method (DS and non-adaptive sampling), thresholds were 
selected to achieve FDR = 0.05. Lower values of NDR coirespond to more accurate recovery; DS clearly outperforms non-adaptive sensing 
over the entire SNR range and shows much less dependence on the signal dimension p. 

were selected so that the FDRs were fixed at approximately 0.05. The resulting average FDRs and NDRs for each 
SNR level are shown. The results show that not only does DS achieve significantly lower NDRs than non-adaptive 
sampling over the entire SNR range, its performance also exhibits much less dependence on the signal dimension 
V- 

VI. Concluding Remarks 

There has been a tremendous interest in high-dimensional testing and detection problems in recent years. A 
well-developed theory exists for such problems when using a single, non-adaptive observation model HI, ||2l, 
|l4l-||6l. However, in practice and theory, multistage adaptive designs have shown promise l7l- |fT0l . This paper 
quantifies the improvements such methods can achieve. We proposed and analyzed a specific multistage design 
called Distilled Sensing (DS), and established that DS is capable of detecting and localizing much weaker sparse 
signals than non-adaptive methods. The main result shows that adaptivity allows reliable detection and localization 
at a signal-to-noise ratio (SNR) that is roughly logp lower than the minimum required by non-adaptive methods, 
where p is the problem dimension. To put this in context, suppose one is interested in screening p = 20, 000 
genes, then logp ~ 10. Thus, the gains can be quite significant in problem sizes of practical interest, which is why 
experimentaUsts often do employ similar methods. 

An additional point worthy of future investigation is the development of lower bounds, characterizing the minimum 
amplitude yu(p) below which signal detection and localization are impossible for any sensing procedure (including 
adaptive sensing). In general, lower bounds are difficult to devise for sequential experimental design settings, with 
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a few notable exceptions |fT9l . ||20l . Here, our results establish that significant improvements are achievable using 
adaptivity, although we relegate any general claims of optimality for adaptive sensing procedures to future work. 

There are several possible extensions to DS. One is to consider even sparser signal models, where the number 
of nonzero entries is significantly smaller than p^^^ for j3 G (0,1), as considered here. In particular, the same 
asymptotic results stated here follow also for signals whose sparsity levels are as small as a constant times 
log log log p. Indeed, making this choice of si in dHJ leads to the same bound on the e| given in (|9]l, and this 
choice is also sufficient to ensure that (fTol i holds as well. In addition, for this choice of si the same bound is 
obtained in ( fTTT l. and the rest of the proof goes through as stated. Another extension is to use DS with alternate 
measurement models. For example, each measurement could be a linear combination of the entries of x, rather 
than direct measurements of individual components. If the linear combinations are non-adaptive, this leads to a 
regression model commonly studied in the Lasso and Compressed Sensing literature — see, for example, II2TI . 1221 . 
However, sequentially tuning the linear combinations leads to an adaptive version of the regression model which 
can be shown to provide significant improvements, as well ||231 . 

Appendix A 
Thresholds for Non- Adaptive Recovery 

In this section we give a proof of Theorem III. 31 We will proceed by considering two cases separately: (i) r > 
and (ii) r < /3. The analysis of the phase transition point r = /3 is interesting, but it is beyond the scope of this 
paper. Begin by noticing that in the setting of the theorem the minimax optimal support estimation procedure to 
control the false and non-discovery proportions is a simple coordinate-wise thresholding procedure of the form 

S = {i ■■ Vz > t} , 

where r > can be chosen appropriately. A formal proof of this optimality can be done by noting that the class 
of hypothesis is invariant under permutations (see ID, IS) for details). 

Case (i) r > /3: In this case the signal support can be accurately identified from the observations, in the sense 
that FDP(5) and NDP(5) both converge in probability to zero. For this case we will take r = t{p) = y^2a log p, 
where (5 < a < r. 

Begin by defining D~ and Mg to be the number of retained non-signal components and the number of missed 
signal components, respectively. Formally 

p 

Dz = l{j/i>T. Xi=0} I 

1=1 

and 

p 

Ms = ^ l{a,<r, x,^0} ■ 
1=1 

Note that is binomially distributed, that is ^ Bin(p(l — p^^), qz), where Qz = Pr{yi > r) when i is such 
that Xi — 0. By noticing that r > and using a standard Gaussian tail bound we have that 
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In a similar fashion note that Alg ^ Bin(p^ ^,qs), where Qs = Pr(j/i < r) when i is such that Xi = y/2r log p. 
Let Z ^ A/^(0, 1) be an auxiliary random variable. Then 



q, = Pr(Z + ^/2rlogp < r) 
= Pr(Z < T - v/2rlogp) 



And so, using the Gaussian tail bound we have 



We are ready to show that both FDP(5) and NDP(5) converge in probability to zero. Begin by noticing that 
NDP(5) = Mjp^-f^. By definition NDP(5) 4 means that for all fixed e > 0, 

Pr(|NDP(5)| > e) ^ , 

as p oo. Noting that NDP(5) is non-negative, this can be easily established using Markov's inequality. 



Pr(NDP(5) > e) 



Pr 



1-/3 



> e 



= Pr(A/, > ep^-^) 

mis] 



< 



epi-l3 



as p — > oo as clearly qg converges to zero (since r > a). For the false discovery proportion the reasoning is similar. 
Note that the number of correct discoveries is p^~^ — Alg. Taking this into account we have 



FDP(5) 



Let e > 0. Then 



Pr(FDP(5) > e) 



= Pr 



p^-P - Ms + 



> e 



M, 



E 



< 



< 



P 



1-/3 



> e 



l-ep{l-p-^)q 



e 

1 - e 



P 



1-/3 



P^l-P-^)- 



1 



1 



=p 



:P 



e " ■ " Ana log p^ ^/ Att log p{y/r - y/a) 

where the last line clearly converges to zero as p oo, since /3 < a < r. Therefore we conclude that FDP(5) 
converges to zero in probability. 
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Case (ii) r < In this case we will show that no thresholding procedure can simultaneously control the false 
and non-discovery proportions. Begin by noting that the smaller r is, the easier it is to control the non-discovery 
proportion. In what follows we will identify an upper-bound on r necessary for the control of the non-discovery 



rate. Note that if t = t{p) = ^J2r \ogp then qs = 1/2, and therefore 

NDP(5) = ^A- 1/2 , 

^ p 

as p — > oo, by the law of large numbers. Therefore a necessary, albeit insufficient, condition for NDP(5) is 
that for all but finitely many p 



T < V2rlogp . (12) 

Similarly, note that the larger t is, the easier it is to control the false discovery rate. In the same spirit of the above 
derivation we will identify a lower-bound for t that must necessarily hold in order to control the false-discovery 
rate. Recall the previous derivation, where we showed that, for any e > 

Pr(FDP(5)>.) = V.[il-e)^^+e^^>e 



= Pr — ^7 > 



where the last inequality follows trivially given that Ms > and, without loss of generaUty, we assume that e < 1. 
This means that FDP(5) converges in probability to zero only if -^^^ also converges in probability to zero. Namely, 
for any e > we must have limp_yoo Pi'{Dz/p^~^ > e) — 0. In what follows take r = ^/2r log p. Let e > and 
note that 

Pr(-^>.) = PriDz > ep^-'^) 

= FiiDz - E[Dz] > ep'-^ - E[D4) 
= PriDz-E[Dz]>ep'-^ -p{l-p-^)qz) . 
Define a = ep^~^ — p{l — p~^)qz- Note that by the Gaussian tail bound, we have 

1 - Tn ]P <<iz< 



y/A-Kr \ogp \ 2r\ogpJ " \JA-Kr\ogp 

or equivalently, 



^/47rr logp 
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Given this it is straightforward to see that 



V47rr logp 



ep 



i-^-(l-o(l))- 



P 



y/inr log p 

pl-r L^r-, I -oil) 

\ V 47rr log p 



\/47rr logp 

where in the last step we use the assumption that (3 > r. Therefore a — >■ — oo as p goes to infinity. Let po(e) G 
be such that a < Q for all p > po{e). Then 

Pr(i5,/pi-'^ > e) = Pr(D, - E[i5,] > a) 

= 1 - PiiD, - E[D,] < a) 
> l-Pr{\D,-E[D,]\>-a) 

i-ar ' 

where Yar{Dz) = p{l — p~^)q,{l — q^) is the variance of and the last step uses Chebyshev's inequaUty. 

Recalling that p > po(e) we can examine the last term in the above expression easily. 

VarOD^ ^ pil-p-^)q, 
(-a)2 ^ a2 

pi-'' 47rrlogp 
= 1 - (1 - 0(1)) — 

V47rrlogp '^'^ 



- l-(l-o(l))^^l^-l, 

pi r 

as p — > OO. Therefore we conclude that, for r = ^/2r logp, Dz/p^^^ does not converge in probability to zero, and 
therefore FDP(5) also does not converge to zero. 

The above result means that a necessary condition for the convergence of FDP(5) to zero is that for all but 
finitely many p 

T > ^y2rlogp . 

This, together with (fT2] i shows that there is no thresholding procedure capable of controlling both the false-discovery 
and non-discovery proportions when r < /? as we wanted to show, concluding the proof. 

Appendix B 
Auxiliary Material 

Lemma B.l. Let < /(p) < 1/2 and g{p) > be any sequences in p such that limp_yoo f{p)g{p) = 0. Then 

lim (1 + f{p)Y^^^ = lim (1 - /(p))^^"^ = 1 . 

p—^oo p^oc 
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Proof: To establish that lnTip^oo(l + f{p))^^^'' = 1 note that 

1 < (1 + fip))'^"^ = cxp {g{p) log(l + f{p))) < cxp {g{p)f{p)) , 

where the last inequality follows from log(l + x) < x for all x > 0. As g{p),f{p) ^> we conclude that 

limp^oo(l + /b))*'(P' = 1. 

The second part of the result is established in a similar fashion. Note that 

log (1 - /(p)) = - log f-^rr) = log ' ^^^^ 



where the last inequality relies on the fact that f{p) < 1/2. Using this fact we have that 

1 > (1 - fip))'^'"^ = cxp {g{p) log(l - /(p))) > cxp {^2f{p)g{p)) . 
Taking into account that g{p)f{p) establishes the desired result. ■ 

Lemma B.2. Let k = k{p) be a positive integer sequence in p, and let g ~ g{p) be a positive nondecreasing 
sequence in p. For some fixed a > 1 let cj = ej{p) < a^^ /g{p). If g{p) > a^^(l + rj), for some fixed ij > 0, then 

Hp) 

lim TT(l-e,(p))>0. 

//, in addition, g[p) is any positive monotone diverging sequence in p, then 

Hp) 

lim X{{l~e,{p)) = l. 



Proof: Note that 



p—^oo ■ 



^fe(p) \ fc(p) 



Hp) 



Hp) 

> - 



/9{P)J 

a^^/gip) 



^T-a-V.9b) 

Hp) 

^ i-„-i/„r„^ E"'V.9(p) 



1 - a- Vs(rt ^ 

Hp) 



gip) - a- 



Now, using the formula for the sum of a geometric series, we have 



'Hp) 1 



1 - a- 
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from which it follows that 



Hp) 



( 



9{p) 



1 



1 



a-i(l - a 



1 - a 



1 




) 



i=i 



Now, assuming only that g{p) > a 



^(1 + ?/), for some fixed > it is easy to see that 

k{p) 



lim [](l-e,(p))>0, 



and if g{p) — >■ oo as p — >• oo we have 



kip) 



lim n(i-e,b)) = i, 



as claimed. 
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